{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 第八章 应用机器学习于情感分析\n",
    "本章包含以下内容：\n",
    "·\t清洗和准备文本数据\n",
    "·\t根据文本数据建立特征向量\n",
    "·\t训练机器学习模型来区分正面或者负面评论\n",
    "·\t用基于外存的学习方法来处理大型文本数据集\n",
    "·\t根据文档推断主题进行分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [],
   "source": [
    "#使用IMDB电影评论数据集，来源于http://ai.stanford.edu/~amaas/data/sentiment\n",
    "import pandas as pd\n",
    "import numpy as np"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [
    {
     "data": {
      "text/plain": "   index                                             review  sentiment\n0  27632  i was greatly moved when i watched the movie.h...          1\n1  36119  The warmest, most engaging movie of its genre,...          1\n2   4796  Wow, I think the overall average rating of thi...          1\n3   3648  I liked the quiet noir of the first part, the ...          1\n4  24501  This is another Sci-Fi channel original movie ...          0",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>index</th>\n      <th>review</th>\n      <th>sentiment</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>27632</td>\n      <td>i was greatly moved when i watched the movie.h...</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>36119</td>\n      <td>The warmest, most engaging movie of its genre,...</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>4796</td>\n      <td>Wow, I think the overall average rating of thi...</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>3648</td>\n      <td>I liked the quiet noir of the first part, the ...</td>\n      <td>1</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>24501</td>\n      <td>This is another Sci-Fi channel original movie ...</td>\n      <td>0</td>\n    </tr>\n  </tbody>\n</table>\n</div>"
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_data = pd.read_csv('./data/movie_data.csv', header=0)\n",
    "np.random.seed(10)\n",
    "movie_data = movie_data.reindex(np.random.permutation(movie_data.index))\n",
    "movie_data = movie_data.reset_index()\n",
    "movie_data.head()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 词袋模型"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'love': 4, 'my': 6, 'wife': 11, 'bin': 0, 'she': 7, 'so': 8, 'much': 5, 'can': 1, 'lose': 3, 'want': 10, 'stay': 9, 'with': 12, 'you': 13, 'forever': 2}\n",
      "[[2 0 0 0 1 0 1 0 0 0 0 1 0 0]\n",
      " [0 0 0 0 1 1 0 1 1 0 0 0 0 0]\n",
      " [0 1 0 1 0 0 1 0 0 0 0 1 0 0]\n",
      " [2 0 1 0 0 0 0 0 0 1 1 0 1 1]]\n"
     ]
    }
   ],
   "source": [
    "#统计每个单词出现的频率计算次袋模型\n",
    "from sklearn.feature_extraction.text import CountVectorizer\n",
    "\n",
    "count = CountVectorizer()\n",
    "docs = np.array(\n",
    "    [\"I love my wife bin bin\", \"I love she so much!\", \"I can't lose my wife\", \"I want stay with you forever，bin bin\"])\n",
    "bag = count.fit_transform(docs)\n",
    "print(count.vocabulary_)\n",
    "print(bag.toarray())"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "特征向量的每个索引位置对应于词汇存储在CountVectorizer字典项中的整数值。例如，索引位置0上的第一个特征等同于单词'bin'的词频，它只出现在最后一个和第一个文档中。这些值的特征向量也叫原词频率：tf（t，d），即t项在文档中出现的次数d。count.vocabulary_中每个单词对应的数字是该单词在特征向量中的位置索引。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 通过词频逆反文档频率评估单词的相关性"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "![](picture/评估单词的相关性.png)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[0.76 0.   0.   0.   0.38 0.   0.38 0.   0.   0.   0.   0.38 0.   0.  ]\n",
      " [0.   0.   0.   0.   0.41 0.53 0.   0.53 0.53 0.   0.   0.   0.   0.  ]\n",
      " [0.   0.56 0.   0.56 0.   0.   0.44 0.   0.   0.   0.   0.44 0.   0.  ]\n",
      " [0.58 0.   0.37 0.   0.   0.   0.   0.   0.   0.37 0.37 0.   0.37 0.37]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import TfidfTransformer\n",
    "\n",
    "tfidf = TfidfTransformer(use_idf=True,\n",
    "                         norm='l2',\n",
    "                         smooth_idf=True)\n",
    "np.set_printoptions(precision=2)\n",
    "print(tfidf.fit_transform(count.fit_transform(docs)).toarray())\n",
    "#tfidf,就是可以把在所有文档中都大量出现的词的权重给降低了，比如这里的\"I\""
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "![](picture/tfidf的l2标准化.png)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 清洗文本数据\n",
    "Porter分词算法,返回单词的词根"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "outputs": [
    {
     "data": {
      "text/plain": "['runer', 'like', 'rune', 'and', 'thu', 'they', 'run']"
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "porter = PorterStemmer()\n",
    "\n",
    "\n",
    "def tokenizer_porter(text):\n",
    "    return [porter.stem(word) for word in text.split()]\n",
    "\n",
    "\n",
    "tokenizer_porter(\"runers like runing and thus they run\")\n",
    "\n",
    "\n",
    "def tokenizer(text):\n",
    "    return text.split()\n",
    "\n",
    "\n",
    "tokenizer_porter(\"runers like runing and thus they run\")"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /Users/apple/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/stopwords.zip.\n"
     ]
    }
   ],
   "source": [
    "#去除停用词\n",
    "import nltk\n",
    "\n",
    "nltk.download(\"stopwords\")\n",
    "from nltk.corpus import stopwords"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [
    {
     "data": {
      "text/plain": "['runer', 'like', 'run', 'run', 'lot']"
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "stop = stopwords.words(\"english\")\n",
    "[w for w in tokenizer_porter(\"a runer likes running and runs a lot\"\n",
    "                             ) if w not in stop]"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "outputs": [],
   "source": [
    "#stop=stopwords.words(\"chinse\"),没有中文\n",
    "#训练逻辑回归模型\n",
    "movie_data = movie_data[[\"review\", \"sentiment\"]]\n",
    "X_train = movie_data.loc[:25000, 'review'].values\n",
    "y_train = movie_data.loc[:25000, 'sentiment'].values\n",
    "X_test = movie_data.loc[25000:, 'review'].values\n",
    "y_test = movie_data.loc[25000:, 'sentiment'].values\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fitting 5 folds for each of 48 candidates, totalling 240 fits\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  n_iter_i = _check_optimize_result(\n",
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/feature_extraction/text.py:396: UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['abov', 'ani', 'becaus', 'befor', 'doe', 'dure', 'ha', 'hi', \"it'\", 'onc', 'onli', 'ourselv', \"she'\", \"should'v\", 'themselv', 'thi', 'veri', 'wa', 'whi', \"you'r\", \"you'v\", 'yourselv'] not in stop_words.\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "ename": "KeyboardInterrupt",
     "evalue": "",
     "output_type": "error",
     "traceback": [
      "\u001B[0;31m---------------------------------------------------------------------------\u001B[0m",
      "\u001B[0;31mKeyboardInterrupt\u001B[0m                         Traceback (most recent call last)",
      "Input \u001B[0;32mIn [59]\u001B[0m, in \u001B[0;36m<cell line: 28>\u001B[0;34m()\u001B[0m\n\u001B[1;32m     19\u001B[0m lr_tfidf \u001B[38;5;241m=\u001B[39m Pipeline([(\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mvect\u001B[39m\u001B[38;5;124m'\u001B[39m, tfidf),\n\u001B[1;32m     20\u001B[0m                      (\u001B[38;5;124m'\u001B[39m\u001B[38;5;124mclf\u001B[39m\u001B[38;5;124m'\u001B[39m, LogisticRegression(random_state\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m0\u001B[39m))])\n\u001B[1;32m     22\u001B[0m gs_lr_tfidf \u001B[38;5;241m=\u001B[39m GridSearchCV(lr_tfidf, param_grid,\n\u001B[1;32m     23\u001B[0m                            scoring\u001B[38;5;241m=\u001B[39m\u001B[38;5;124m'\u001B[39m\u001B[38;5;124maccuracy\u001B[39m\u001B[38;5;124m'\u001B[39m,\n\u001B[1;32m     24\u001B[0m                            cv\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m5\u001B[39m,\n\u001B[1;32m     25\u001B[0m                            verbose\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m1\u001B[39m,\n\u001B[1;32m     26\u001B[0m                            n_jobs\u001B[38;5;241m=\u001B[39m\u001B[38;5;241m-\u001B[39m\u001B[38;5;241m1\u001B[39m)\n\u001B[0;32m---> 28\u001B[0m \u001B[43mgs_lr_tfidf\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mfit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX_train\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my_train\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/model_selection/_search.py:891\u001B[0m, in \u001B[0;36mBaseSearchCV.fit\u001B[0;34m(self, X, y, groups, **fit_params)\u001B[0m\n\u001B[1;32m    885\u001B[0m     results \u001B[38;5;241m=\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_format_results(\n\u001B[1;32m    886\u001B[0m         all_candidate_params, n_splits, all_out, all_more_results\n\u001B[1;32m    887\u001B[0m     )\n\u001B[1;32m    889\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m results\n\u001B[0;32m--> 891\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_run_search\u001B[49m\u001B[43m(\u001B[49m\u001B[43mevaluate_candidates\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    893\u001B[0m \u001B[38;5;66;03m# multimetric is determined here because in the case of a callable\u001B[39;00m\n\u001B[1;32m    894\u001B[0m \u001B[38;5;66;03m# self.scoring the return type is only known after calling\u001B[39;00m\n\u001B[1;32m    895\u001B[0m first_test_score \u001B[38;5;241m=\u001B[39m all_out[\u001B[38;5;241m0\u001B[39m][\u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mtest_scores\u001B[39m\u001B[38;5;124m\"\u001B[39m]\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/model_selection/_search.py:1392\u001B[0m, in \u001B[0;36mGridSearchCV._run_search\u001B[0;34m(self, evaluate_candidates)\u001B[0m\n\u001B[1;32m   1390\u001B[0m \u001B[38;5;28;01mdef\u001B[39;00m \u001B[38;5;21m_run_search\u001B[39m(\u001B[38;5;28mself\u001B[39m, evaluate_candidates):\n\u001B[1;32m   1391\u001B[0m     \u001B[38;5;124;03m\"\"\"Search all candidates in param_grid\"\"\"\u001B[39;00m\n\u001B[0;32m-> 1392\u001B[0m     \u001B[43mevaluate_candidates\u001B[49m\u001B[43m(\u001B[49m\u001B[43mParameterGrid\u001B[49m\u001B[43m(\u001B[49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mparam_grid\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/model_selection/_search.py:838\u001B[0m, in \u001B[0;36mBaseSearchCV.fit.<locals>.evaluate_candidates\u001B[0;34m(candidate_params, cv, more_results)\u001B[0m\n\u001B[1;32m    830\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39mverbose \u001B[38;5;241m>\u001B[39m \u001B[38;5;241m0\u001B[39m:\n\u001B[1;32m    831\u001B[0m     \u001B[38;5;28mprint\u001B[39m(\n\u001B[1;32m    832\u001B[0m         \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mFitting \u001B[39m\u001B[38;5;132;01m{0}\u001B[39;00m\u001B[38;5;124m folds for each of \u001B[39m\u001B[38;5;132;01m{1}\u001B[39;00m\u001B[38;5;124m candidates,\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    833\u001B[0m         \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124m totalling \u001B[39m\u001B[38;5;132;01m{2}\u001B[39;00m\u001B[38;5;124m fits\u001B[39m\u001B[38;5;124m\"\u001B[39m\u001B[38;5;241m.\u001B[39mformat(\n\u001B[1;32m    834\u001B[0m             n_splits, n_candidates, n_candidates \u001B[38;5;241m*\u001B[39m n_splits\n\u001B[1;32m    835\u001B[0m         )\n\u001B[1;32m    836\u001B[0m     )\n\u001B[0;32m--> 838\u001B[0m out \u001B[38;5;241m=\u001B[39m \u001B[43mparallel\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    839\u001B[0m \u001B[43m    \u001B[49m\u001B[43mdelayed\u001B[49m\u001B[43m(\u001B[49m\u001B[43m_fit_and_score\u001B[49m\u001B[43m)\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    840\u001B[0m \u001B[43m        \u001B[49m\u001B[43mclone\u001B[49m\u001B[43m(\u001B[49m\u001B[43mbase_estimator\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    841\u001B[0m \u001B[43m        \u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    842\u001B[0m \u001B[43m        \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    843\u001B[0m \u001B[43m        \u001B[49m\u001B[43mtrain\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtrain\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    844\u001B[0m \u001B[43m        \u001B[49m\u001B[43mtest\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtest\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    845\u001B[0m \u001B[43m        \u001B[49m\u001B[43mparameters\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mparameters\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    846\u001B[0m \u001B[43m        \u001B[49m\u001B[43msplit_progress\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43msplit_idx\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mn_splits\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    847\u001B[0m \u001B[43m        \u001B[49m\u001B[43mcandidate_progress\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mcand_idx\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mn_candidates\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    848\u001B[0m \u001B[43m        \u001B[49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[38;5;241;43m*\u001B[39;49m\u001B[43mfit_and_score_kwargs\u001B[49m\u001B[43m,\u001B[49m\n\u001B[1;32m    849\u001B[0m \u001B[43m    \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    850\u001B[0m \u001B[43m    \u001B[49m\u001B[38;5;28;43;01mfor\u001B[39;49;00m\u001B[43m \u001B[49m\u001B[43m(\u001B[49m\u001B[43mcand_idx\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mparameters\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m(\u001B[49m\u001B[43msplit_idx\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43m(\u001B[49m\u001B[43mtrain\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mtest\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;129;43;01min\u001B[39;49;00m\u001B[43m \u001B[49m\u001B[43mproduct\u001B[49m\u001B[43m(\u001B[49m\n\u001B[1;32m    851\u001B[0m \u001B[43m        \u001B[49m\u001B[38;5;28;43menumerate\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mcandidate_params\u001B[49m\u001B[43m)\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[38;5;28;43menumerate\u001B[39;49m\u001B[43m(\u001B[49m\u001B[43mcv\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43msplit\u001B[49m\u001B[43m(\u001B[49m\u001B[43mX\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43my\u001B[49m\u001B[43m,\u001B[49m\u001B[43m \u001B[49m\u001B[43mgroups\u001B[49m\u001B[43m)\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    852\u001B[0m \u001B[43m    \u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    853\u001B[0m \u001B[43m\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    855\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mlen\u001B[39m(out) \u001B[38;5;241m<\u001B[39m \u001B[38;5;241m1\u001B[39m:\n\u001B[1;32m    856\u001B[0m     \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mValueError\u001B[39;00m(\n\u001B[1;32m    857\u001B[0m         \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mNo fits were performed. \u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    858\u001B[0m         \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mWas the CV iterator empty? \u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    859\u001B[0m         \u001B[38;5;124m\"\u001B[39m\u001B[38;5;124mWere there no candidates?\u001B[39m\u001B[38;5;124m\"\u001B[39m\n\u001B[1;32m    860\u001B[0m     )\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/joblib/parallel.py:1056\u001B[0m, in \u001B[0;36mParallel.__call__\u001B[0;34m(self, iterable)\u001B[0m\n\u001B[1;32m   1053\u001B[0m     \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_iterating \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mFalse\u001B[39;00m\n\u001B[1;32m   1055\u001B[0m \u001B[38;5;28;01mwith\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_backend\u001B[38;5;241m.\u001B[39mretrieval_context():\n\u001B[0;32m-> 1056\u001B[0m     \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mretrieve\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m   1057\u001B[0m \u001B[38;5;66;03m# Make sure that we get a last message telling us we are done\u001B[39;00m\n\u001B[1;32m   1058\u001B[0m elapsed_time \u001B[38;5;241m=\u001B[39m time\u001B[38;5;241m.\u001B[39mtime() \u001B[38;5;241m-\u001B[39m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_start_time\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/joblib/parallel.py:935\u001B[0m, in \u001B[0;36mParallel.retrieve\u001B[0;34m(self)\u001B[0m\n\u001B[1;32m    933\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[1;32m    934\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mgetattr\u001B[39m(\u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_backend, \u001B[38;5;124m'\u001B[39m\u001B[38;5;124msupports_timeout\u001B[39m\u001B[38;5;124m'\u001B[39m, \u001B[38;5;28;01mFalse\u001B[39;00m):\n\u001B[0;32m--> 935\u001B[0m         \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_output\u001B[38;5;241m.\u001B[39mextend(\u001B[43mjob\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mget\u001B[49m\u001B[43m(\u001B[49m\u001B[43mtimeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mtimeout\u001B[49m\u001B[43m)\u001B[49m)\n\u001B[1;32m    936\u001B[0m     \u001B[38;5;28;01melse\u001B[39;00m:\n\u001B[1;32m    937\u001B[0m         \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_output\u001B[38;5;241m.\u001B[39mextend(job\u001B[38;5;241m.\u001B[39mget())\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/joblib/_parallel_backends.py:542\u001B[0m, in \u001B[0;36mLokyBackend.wrap_future_result\u001B[0;34m(future, timeout)\u001B[0m\n\u001B[1;32m    539\u001B[0m \u001B[38;5;124;03m\"\"\"Wrapper for Future.result to implement the same behaviour as\u001B[39;00m\n\u001B[1;32m    540\u001B[0m \u001B[38;5;124;03mAsyncResults.get from multiprocessing.\"\"\"\u001B[39;00m\n\u001B[1;32m    541\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:\n\u001B[0;32m--> 542\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[43mfuture\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mresult\u001B[49m\u001B[43m(\u001B[49m\u001B[43mtimeout\u001B[49m\u001B[38;5;241;43m=\u001B[39;49m\u001B[43mtimeout\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    543\u001B[0m \u001B[38;5;28;01mexcept\u001B[39;00m CfTimeoutError \u001B[38;5;28;01mas\u001B[39;00m e:\n\u001B[1;32m    544\u001B[0m     \u001B[38;5;28;01mraise\u001B[39;00m \u001B[38;5;167;01mTimeoutError\u001B[39;00m \u001B[38;5;28;01mfrom\u001B[39;00m \u001B[38;5;21;01me\u001B[39;00m\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/concurrent/futures/_base.py:441\u001B[0m, in \u001B[0;36mFuture.result\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m    438\u001B[0m \u001B[38;5;28;01melif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;241m==\u001B[39m FINISHED:\n\u001B[1;32m    439\u001B[0m     \u001B[38;5;28;01mreturn\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m__get_result()\n\u001B[0;32m--> 441\u001B[0m \u001B[38;5;28;43mself\u001B[39;49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43m_condition\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43mwait\u001B[49m\u001B[43m(\u001B[49m\u001B[43mtimeout\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    443\u001B[0m \u001B[38;5;28;01mif\u001B[39;00m \u001B[38;5;28mself\u001B[39m\u001B[38;5;241m.\u001B[39m_state \u001B[38;5;129;01min\u001B[39;00m [CANCELLED, CANCELLED_AND_NOTIFIED]:\n\u001B[1;32m    444\u001B[0m     \u001B[38;5;28;01mraise\u001B[39;00m CancelledError()\n",
      "File \u001B[0;32m~/opt/anaconda3/envs/Deep-Learning/lib/python3.10/threading.py:320\u001B[0m, in \u001B[0;36mCondition.wait\u001B[0;34m(self, timeout)\u001B[0m\n\u001B[1;32m    318\u001B[0m \u001B[38;5;28;01mtry\u001B[39;00m:    \u001B[38;5;66;03m# restore state no matter what (e.g., KeyboardInterrupt)\u001B[39;00m\n\u001B[1;32m    319\u001B[0m     \u001B[38;5;28;01mif\u001B[39;00m timeout \u001B[38;5;129;01mis\u001B[39;00m \u001B[38;5;28;01mNone\u001B[39;00m:\n\u001B[0;32m--> 320\u001B[0m         \u001B[43mwaiter\u001B[49m\u001B[38;5;241;43m.\u001B[39;49m\u001B[43macquire\u001B[49m\u001B[43m(\u001B[49m\u001B[43m)\u001B[49m\n\u001B[1;32m    321\u001B[0m         gotit \u001B[38;5;241m=\u001B[39m \u001B[38;5;28;01mTrue\u001B[39;00m\n\u001B[1;32m    322\u001B[0m     \u001B[38;5;28;01melse\u001B[39;00m:\n",
      "\u001B[0;31mKeyboardInterrupt\u001B[0m: "
     ]
    }
   ],
   "source": [
    "tfidf = TfidfVectorizer(strip_accents=None,\n",
    "                        lowercase=False,\n",
    "                        preprocessor=None)\n",
    "\n",
    "param_grid = [{'vect__ngram_range': [(1, 1)],\n",
    "               'vect__stop_words': [stop, None],\n",
    "               'vect__tokenizer': [str.split, str.split],\n",
    "               'clf__penalty': ['l1', 'l2'],\n",
    "               'clf__C': [1.0, 10.0, 100.0]},\n",
    "              {'vect__ngram_range': [(1, 1)],\n",
    "               'vect__stop_words': [stop, None],\n",
    "               'vect__tokenizer': [tokenizer, tokenizer_porter],\n",
    "               'vect__use_idf': [False],\n",
    "               'vect__norm': [None],\n",
    "               'clf__penalty': ['l1', 'l2'],\n",
    "               'clf__C': [1.0, 10.0, 100.0]},\n",
    "              ]\n",
    "\n",
    "lr_tfidf = Pipeline([('vect', tfidf),\n",
    "                     ('clf', LogisticRegression(random_state=0))])\n",
    "\n",
    "gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,\n",
    "                           scoring='accuracy',\n",
    "                           cv=5,\n",
    "                           verbose=1,\n",
    "                           n_jobs=-1)\n",
    "\n",
    "gs_lr_tfidf.fit(X_train, y_train)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "outputs": [
    {
     "data": {
      "text/plain": "['The',\n 'warmest,',\n 'most',\n 'engaging',\n 'movie',\n 'of',\n 'its',\n 'genre,',\n 'Those',\n 'Lips,',\n 'Those',\n 'Eyes,',\n 'made',\n 'me',\n 'smile',\n 'and',\n 'cry',\n 'as',\n 'it',\n 'reminded',\n 'me',\n 'of',\n 'the',\n 'work',\n 'it',\n 'takes',\n 'to',\n 'pursue',\n 'a',\n 'dream',\n 'and',\n 'the',\n 'pain',\n 'of',\n 'disappointment.',\n 'Hulce',\n 'and',\n 'Langella',\n 'are',\n 'superb',\n 'and',\n 'the',\n 'story',\n 'seems',\n 'to',\n 'write',\n 'itself.',\n 'A',\n 'brilliant',\n 'screenplay',\n 'by',\n 'David',\n 'Shaber',\n '(one',\n 'of',\n 'my',\n 'favorites!',\n '-',\n 'see',\n 'The',\n 'Warriors',\n 'and',\n 'Nighthawks',\n 'for',\n 'more...)',\n 'and',\n 'beautiful',\n 'sets',\n 'filmed',\n 'on',\n 'location',\n '(I',\n 'think)',\n 'at',\n 'the',\n 'actual',\n 'summer',\n 'theater',\n 'in',\n 'which',\n 'the',\n 'story',\n 'takes',\n 'place.',\n 'You',\n \"can't\",\n 'see',\n 'this',\n 'movie',\n 'and',\n 'not',\n 'want',\n 'to',\n 'drop',\n 'everything',\n 'and',\n 'get',\n 'into',\n 'the',\n 'theater!',\n 'Please',\n 'check',\n 'this',\n 'video',\n 'out',\n 'if',\n 'you',\n 'can',\n 'find',\n 'it.']"
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#最优的参数\n",
    "print(\"Best parameter set :%s\" % gs_lr_tfidf.best_params_)\n",
    "#分类准确率\n",
    "print('CV Accuracy:%.3f' % gs_lr_tfidf.bset_score)\n",
    "#测试集上的准确率\n",
    "clf = gs_lr_tfidf.best_estimator_\n",
    "print(\"Test Accuracy:%.3f\" % clf.score(X_test, y_test))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 处理更大的数据集\n",
    "在线算法和核心学习\n",
    "从上一节执行过的代码示例中你可能已经注意到，在做网格搜索时，为50 000个电影评论数据集构造特征向量的代价可能非常昂贵。在许多实际应用中，超出计算机内存的规模处理更大数据集的情况并不少见。但并不是每个人都能访问超级计算机，所以现在将采用一种被称为“核外学习”的技术，这种技术可以通过对数据集的小批增量来模拟分类器完成大型数据集的处理工作。本节将用scikit-learn的SGDClassifier的partial_fit函数从本地驱动器直接获取流式文件，并用文件的小批次文档训练逻辑回归模型。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "outputs": [],
   "source": [
    "import re\n",
    "from nltk.corpus import stopwords\n",
    "\n",
    "stop = stopwords.words('english')\n",
    "\n",
    "\n",
    "def tokenizer(text):\n",
    "    text = re.sub('<[^>]*>', '', text)\n",
    "    emoticons = re.findall('(?::|;|=)(?:-)?(?:\\)|\\(|D|P)', text.lower())\n",
    "    text = re.sub('[\\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')\n",
    "    tokenized = [w for w in text.split() if w not in stop]\n",
    "    return tokenized\n",
    "\n",
    "\n",
    "def stream_docs(path):\n",
    "    with open(path, 'r', encoding='utf-8') as csv:\n",
    "        next(csv)  # skip header\n",
    "        for line in csv:\n",
    "            text, label = line[:-3], int(line[-2])\n",
    "            yield text, label  #生成器，每调用一次他就会产生一行数据，和标签"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [
    {
     "data": {
      "text/plain": "('\"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />\"\"Murder in Greenwich\"\" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich family used their influence to cover the murder for more than twenty years. However, a snoopy detective and convicted perjurer in disgrace was able to disclose how the hideous crime was committed. The screenplay shows the investigation of Mark and the last days of Martha in parallel, but there is a lack of the emotion in the dramatization. My vote is seven.<br /><br />Title (Brazil): Not Available\"',\n 1)"
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "next(stream_docs(path=\"data/movie_data.csv\"))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "outputs": [],
   "source": [
    "def get_minibacth(doc_strem, size):\n",
    "    docs, y = [], []\n",
    "    try:\n",
    "        for s in range(size):\n",
    "            text, label = next(doc_strem)\n",
    "            docs.append(text)\n",
    "            y.append(label)\n",
    "    except StopIteration:\n",
    "        docs, y = None, None\n",
    "    return docs, y"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "不幸的是，因为需要把全部的单词保存在内存，所以无法调用CountVectorizer函数做核心学习。另外，TfidfVectorizer需要把训练集的所有特征向量保存在内存，以计算逆文档频率。然而，scikit-learn实现的另一个有用的向量化工具是HashingVectorizer，该工具用于文本处理并且独立于数据，调用由奥斯汀·爱珀白提出的基于哈希技术的32位MurmurHash3函数（https://sites.google.com/site/murmurhash/）"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction.text import HashingVectorizer\n",
    "from sklearn.linear_model import SGDClassifier\n",
    "vec=HashingVectorizer(decode_error='ignore',\n",
    "                      n_features=2*21,\n",
    "                      preprocessor=None,\n",
    "                      tokenizer=tokenizer)\n",
    "clf=SGDClassifier(loss='log',random_state=1,n_iter_no_change=1)\n",
    "doc_strem=stream_docs('data/movie_data.csv')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Warning: No valid output stream.\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/var/folders/b6/sj0f93ps3rn4fdfsd4mkgj_00000gn/T/ipykernel_76665/3068537521.py:4: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0\n",
      "Please use `tqdm.notebook.*` instead of `tqdm._tqdm_notebook.*`\n",
      "  from tqdm._tqdm_notebook import tqdm_notebook\n"
     ]
    },
    {
     "data": {
      "text/plain": "  0%|          | 0/45 [00:00<?, ?it/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "31b13f11085b4f03aa792a4acf1cbf24"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import pyprind\n",
    "import numpy as np\n",
    "pbar= pyprind.ProgBar(45)\n",
    "from tqdm._tqdm_notebook import tqdm_notebook\n",
    "classs=np.array([0,1])\n",
    "for _ in tqdm_notebook(range(45)):\n",
    "    X_train,y_train=get_minibacth(doc_strem,1000)\n",
    "    if not X_train:\n",
    "        break\n",
    "        print(_)\n",
    "    X_train=vec.transform(X_train)\n",
    "    clf.partial_fit(X_train,y_train,classes=classs)\n",
    "    pbar.update()"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [],
   "source": [
    "X_test,y_test=get_minibacth(doc_strem,size=50)\n",
    "X_test=vec.transform(X_test)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Accuracy:  1\n"
     ]
    }
   ],
   "source": [
    "print(\"Accuracy:%3.f\"%clf.score(X_test,y_test))"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [],
   "source": [
    "clf=clf.partial_fit(X_test,y_test)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 文本LDA主题建模\n",
    "![](picture/LDA.png)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df=pd.read_csv(\"data/movie_data.csv\",encoding='utf-8')\n",
    "from sklearn.feature_extraction.text import CountVectorizer"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [],
   "source": [
    "count=CountVectorizer(stop_words=\"english\",\n",
    "                      max_df=0.1,#最大词频设置为0.1，删除那些词频很大的单词\n",
    "                      max_features=5000)#最大的特征数量是5000，只做最常出现的5000个单词\n",
    "X=count.fit_transform(df['review'].values)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "outputs": [],
   "source": [
    "from sklearn.decomposition import LatentDirichletAllocation\n",
    "lda=LatentDirichletAllocation(n_components=10,\n",
    "                              random_state=123,\n",
    "                              learning_method=\"batch\")#通过设置参数learning_method='batch'，让lda评估器在一次迭代中根据所有可用的训练数据（词袋矩阵）进行估计，这比在线学习方法慢，但可能会带来更准确的预测结果（设置参数learning_method='online'，则与第2章以及本章讨论的在线或小批量学习类似）。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [],
   "source": [
    "X_topics=lda.fit_transform(X)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "#### scikit-learn的LDA实现采用期望最大化（EM）算法来迭代更新参数估计。本章尚未讨论EM算法，但如果想了解更多信息，请参阅维基百科上的精彩概述（https://en.wikipedia.org/wiki/Expectation-maximization_algorithm）以及科罗拉多里德讲解的如何使用LDA的详细教程《LDA：迈向更深入的了解》，该教程可以免费从下述网站获得：http://obphio.us/pdfs/lda_tutorial.pdf。拟合LDA之后，现在可以访问lda实例的components_属性，该属性存储包含每10个主题的按升序排列的单词重要性（此处为5000）的矩阵。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "(10, 5000)"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lda.components_.shape\n",
    "#每个主题有5000个按重要程度排序的单词，单词数量就是特征数量"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Topics: 1\n",
      "worst minutes awful script stupid\n",
      "Topics: 2\n",
      "family mother father children girl\n",
      "Topics: 3\n",
      "american war dvd music tv\n",
      "Topics: 4\n",
      "human audience cinema art sense\n",
      "Topics: 5\n",
      "police guy car dead murder\n",
      "Topics: 6\n",
      "horror house sex girl woman\n",
      "Topics: 7\n",
      "role performance comedy actor performances\n",
      "Topics: 8\n",
      "series episode war episodes tv\n",
      "Topics: 9\n",
      "book version original read novel\n",
      "Topics: 10\n",
      "action fight guy guys cool\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/apple/opt/anaconda3/envs/Deep-Learning/lib/python3.10/site-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.\n",
      "  warnings.warn(msg, category=FutureWarning)\n"
     ]
    }
   ],
   "source": [
    "n_top_words=5\n",
    "feature_names=count.get_feature_names()#5000个单词，即是特征名字\n",
    "for idx,topic in enumerate(lda.components_):#5000个单词对应的tf-idf\n",
    "    print(\"Topics: %d\"%(idx+1))\n",
    "    print(\" \".join([feature_names[i] for i in topic.argsort()[:-n_top_words-1:-1]]))#排序后的倒数五个单词,因为默认是升序"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "outputs": [
    {
     "data": {
      "text/plain": "array([[0.0011114 , 0.19959939, 0.00111146, ..., 0.00111146, 0.00111178,\n        0.00111127],\n       [0.9857103 , 0.00158776, 0.00158769, ..., 0.00158763, 0.0015877 ,\n        0.00158774],\n       [0.38619934, 0.43091153, 0.106478  , ..., 0.00119069, 0.00119094,\n        0.00119079],\n       ...,\n       [0.44586093, 0.0027788 , 0.00277831, ..., 0.00277824, 0.10483994,\n        0.09596916],\n       [0.00384681, 0.53244645, 0.00384701, ..., 0.00384698, 0.00384675,\n        0.00384695],\n       [0.00833551, 0.48130427, 0.00833415, ..., 0.00833491, 0.00833449,\n        0.45201744]])"
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_topics#50000条影评，被分为10类电影的概率"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Horror movie review #1\n",
      "House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...\n",
      "\n",
      "Horror movie review #2\n",
      "Okay, what the hell kind of TRASH have I been watching now? \"The Witches' Mountain\" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...\n",
      "\n",
      "Horror movie review #3\n",
      "<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...\n"
     ]
    }
   ],
   "source": [
    "#选出几段对第六个主题的电影的描述\n",
    "horror=X_topics[:,5].argsort()[::-1]#倒序，就是最有可能分为恐怖电影的影评的概率从大到小\n",
    "for idx,mo in enumerate(horror[:3]):\n",
    "    print(\"\\nHorror movie review #%d\"%(idx+1))\n",
    "    print(df[\"review\"][mo][:300],'...')"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "### 本章学习了如何利用机器学习算法根据文本文档的倾向性对其分类，这是NLP领域情感分析的基本任务。本章不仅学习了如何使用词袋模型将文档编码为特征向量，而且还学习了如何使用tf-idf通过相关性对词频进行加权。由于在该过程中创建了大量特征向量，处理这样的文本数据在计算上可能非常昂贵。上一节学习了如何利用外部存储或增量学习来训练机器学习算法，而不需要将整个数据集加载到计算机内存。最后引入了LDA主题建模的概念，以无监督的方式将电影评论分为不同的类别。下一章中我们将使用自己实现的文档分类器并学习如何将其嵌入到网络应用。"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}